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@ A RISC processor is arranged to reduce a code 
size, make the hardware less complicated, execute a 
plurality of operations for one machine cycle, and 
enhance the performance. The processor is capable 
of executing N instruction each having a short word 
length for indicating a single operation or an instruc- 
tion having a long word length for indicating M 
(N<M) operations. When the number of operations to 
be executed in parallel is large, the long-word in- 
struction is used. When it is small, the short-word 
instruction is used. A competition between the long- 
word instructions is detected by hardware and a 
competition between the short-word instructions only 
is detected by software. The simplification of the 
hardware brings about improvement of a machine 
cycle, improvement of a code cache hit ratio caused 
by the reduction of a code size and increase of the 
number of operations to be executed in parallel for 
the purpose of enhancing the performance. 
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BACKGROUND OF THE INVENTION 

The present invention relates to a computer 
which is capable of executing a parallel operation, 
and more particularly to the connputer having a 
parallel operating function by means of a super 
scalar system and a VLIW system arranged in a 
mingling manner. 

A computer architecture has been progressed 
year by year with an aid of progress of semicon- 
ductor technology. In nineteen eighties, in place of 
a CISC (Complex Instruction Sot Computer) for 
processing a complicated instruction by using 
micro instructions over a plurality of cycles, a RISC 
(Reduced Instruction Set Computer) has emerged 
for executing a simple instruction for one cycle. 

As a higher operating technique, a super scalar 
system and a VLIW (Very Long Instruction Word) 
system have been proposed. 

The super scalar system is a system for de- 
tecting a competition between instructions by hard- 
ware when executing an instruction and executing 
a plurality of instructions over one machine cycle if 
no competition is found out. This system has been 
described in the Japanese Paten Application No. 
63-283673 (prior art 1) or J-Hennessy and D.A. 
Patterson "Computer Architecture A Quantitative 
Approach" Morgan Kantmann Publishers, Inc. 
1990P.318 (prior art 2). 

The VLIW system is a system arranged to use 
a long instruction having a field for controlling an 
operation of two or more operating units. Though 
the normal RISC processor has an instruction 
length of 32 bits, the VLIW system has an instruc- 
tion length of 64, 128, 256 or more bits. This 
system has been described in the aforementioned 
J-Hennessy and D.A. Patterson (prior art 1). 

As an improvement of the VLIW system, a 
technique has been proposed where a one-word 
instruction and a three-word instruction are pro- 
cessed by the VLIW system in a mingling manner. 
This technique thus can improve a code size. The 
technique is described in Robert Cohn et al. 
"Architecture and Compiler Tradeoffs a Long In- 
struction Word Microprocessor" Third International 
Conference on Architectural Support for Program- 
ming Languages and Operating Systems, 1989, p. 
2-14 (prior art 3). 

SUMMARY OF THE INVENTION 

Hereafter, the features of the super scalar sys- 
tem and the VLIS system will be described. 

The super scalar system has a feature of re- 
ducing a code size because the system indicates 
only an effective operation by the short length 
instruction for indicating a simple operation. 



Further, since no additional instruction is nec- 
essary, the super scalar system can keep a com- 
patibility between the new and the previous 
models. 

5 On the other hand, the first shortcoming of the 

super scalar system is that it is necessary to detect 
a competition among operations to be executed in 
parallel. As the operations to be executed in par- 
allel increase in number, the necessary amount of 

10 hardware for detecting the competition is made 
larger. 

The second shortcoming of this system is that 
a complicated check for a competition and queuing 
are required between an instruction executed be- 

75 fore the current cycle and an instruction to be 
executed at the current cycle. As the operations to 
be executed in parallel is increasing in number, 
more instructions are made competitive to the in- 
struction of the current cycle. The hardware for 

20 detecting the competition and queuing between 
both, regarded as the second shortcoming, is made 
more complicated. 

The third shortcoming of the super scalar sys- 
tem is that the registers to be specified by an 

25 instruction are fewer, because the instruction length 
is short- A typical number of the registers is 16 to 
32. As is describing in the writing of J-Hennessy 
and D.A. Patterson, p. 325, it is possible to use a 
loop unrolling or a software pipeline as a device on 

30 software for increasing the operations to be ex- 
ecuted in parallel. However, the number of the 
registers is not sufficient to this device. In other 
words, the optimization is not allowed in the range 
of the existing registers. 

35 As a remedy, the writing of the prior art 1 has 

described in E-21 to 22 a device for inhibiting 
immediate reflection of an operated result on the 
next instruction for improving insufficiency of the 
registers. 

40 It has been described that a super scalar ma- 

chine can prefetch data from a main memory to a 
cache memory by using an instruction in the writ- 
ing of David Callahan et al. "Software Prefetching" 
Fourth International Conference on Architecture 

45 Support for Programming Language and Operating 
System, 1991, p.40 to 52. 

As described above, the super scalar system 
has an obstacle to enhancing a machine cycle if 
the instructions to be executed in parallel are in- 

50 creased in number, because of the first and the 
second shortcomings, that is, the complicated 
check for a competition. Hence, this system is not 
capable of enhancing a processing speed so much. 
Turning to the VLIW system, this system has a 

55 first feature of specifying a plurality of operations in 
a single instruction because the instruction length 
is long and eliminating the necessity of checking 
for a competition among operations to be executed 
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in parallel by hardware when executing the instruc- 
tions. 

As a second feature, the VLIW system enables 
lo specify more registers because the word length 
is long. 5 

The VLIW system has a first shortcoming that 
it cannot necessarily specify an effective operation 
to all the fields and thus needs a larger code size. 

As a second shortcoming, the system provides 
a complicated check for a competition and queuing io 
between an instruction executed before the current 
cycle and an instruction to be executed at the 
current cycle. This is the same as the second 
shortcoming of the super scalar system. 

To overcome this shortcoming, a technique for 75 
pre-avoiding a competition by a compiler without 
having to use hardware for detecting a competition 
has been described in Andrew Wolf and John P. 
Shen "A Variable Instruction Stream Extension to 
the VLIW Architecture", Fourth International Con- 20 
ference on Architecture Support for Programming 
Languages and Operating System, 1991, p. 2 to 
14. 

As a third shortcoming, the VLIS system dis- 
ables to keep a compatibility with the previous 25 
model. This is because the super scalar system 
can execute the conventional one-word instruction 
by hardware, while the VLIW system needs to 
redefine the instruction. 

As described above, no computer has been 30 
proposed for compensating for the shortcomings of 
the super scalar system and the VLIW system as 
keeping the features of those systems. 

It is an object of the present invention to pro- 
vide a computer which is capable of executing an 35 
operation with the super scalar system and the 
VLIW system being in a mingling manner. It means 
that the computer has a faster processing speed as 
keeping an upward compatibility with a computer 
having the conventional architecture made of a 40 
shorter instruction for indicating a single operation. 

In carrying out the object, according to a first 
aspect of the invention, the computer is arranged 
to have a register, a memory and a program coun- 
ter and provide a parallel operating function of 45 
reading an instruction stored in the memory and 
indicated by the program counter and a capability 
of executing the operation indicated by the instruc- 
tion with respect to the register, the memory and 
the program counter, the instruction being a short- 50 
length instruction for indicating a single operation 
and a long-length instruction for indicating a plural- 
ity of operations, means for determining if the 
instruction indicated by the program counter is a 
short-length instruction or a long-length instruction, 65 
and means for setting the instruction in the register 
if the instruction is determined to be a long-length 
instruction and for setting the instruction to a pre- 
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determined register if the instruction is determined 
to be a short-length instruction. 

According to a second aspect of the invention, 
the computer is arranged lo provide a register, a 
memory, and a program counter and read an in- 
struction indicated by the program counter and 
stored in the memory and provide a capability of 
executing the operation indicated by the instruction 
with respect to the register, the memory and the 
program counter, the instruction being a short- 
length instruction for indicating a single operation 
or a long-length instruction for indicating a plurality 
of operations, means for determining if the instruc- 
tion indicated by the program counter is the short- 
length instruction for indicating a single operation 
or the long-length instruction for indicating a plural- 
ity of operations, means for detecting a competition 
among the short-length instructions, and means for 
setting the instruction to the register if the instruc- 
tion is determined to be the longer one by the word 
length determining means or setting the instruction 
to a predetermined register if the instruction is 
determined to be the shorter one by the word 
length determining means and no competition is 
checked out by the competition checking means. 

According to a third aspect of the invention, the 
computer is arranged to have a register, a memory 
and a program counter, read an instruction in- 
dicated by the program counter from the memory 
and provide a capability of executing an operation 
indicated by the instruction with respect to the 
register, the memory and the program counter, the 
instruction being a short-length instruction for in- 
dicating a single operation or a long-length instruc- 
tion for indicating a plurality of operations, means 
for determining if the instruction indicated by the 
program counter is the short-length instruction or 
the long-length instruction, means for detecting a 
competition among the short-length instructions if 
the instruction is determined to be the short-length 
instruction, and means for executing a predeter- 
mined number of short-length instructions for one 
machine cycle according to the content of the 
competition detecting means if the instruction is 
determined to the short-length instruction or ex- 
ecuting a predetermined number of long-length 
instructions for one machine cycle if the instruction 
is determined to be the long-length instruction. 

According to the invention, the computer is 
capable of executing two or more short-length 
instructions each for indicating a single operation 
for one machine cycle or executing one long-length 
instruction for indicating a plurality of operations in 
parallel for enhancing the performance of the com- 
puter itself. 

According to the invention, the computer uses 
the long-length instruction only when the operations 
to be executed in parallel are made more so as to 
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eliminate a no-operation field in the long-length 
instruction and thereby reducing the code size. 
This results in enhancing the using efficiency of the 
main memory and the cache memory, thereby 
improving the processing speed. 

According to an embodiment of the invention, 
no competition exists among a plurality of oper- 
ations indicated by the long-length instruction. 
Hence, it is not necessary to detect the competition 
by hardware. What is needed for the hardware is to 
detect only the competition among the short-length 
instructions executed for the same cycle. Accord- 
ing to the invention, the number of short-length 
instructions executed for one machine cycle can be 
set to be smaller than the number of operations 
indicated by the long-length instruction for more 
easily detecting a competition among operations to 
be executed in parallel through the number of 
operations executed for one machine cycle is high 
as the average. 

According to another embodiment of the inven- 
tion, the computer is capable of generating an 
instruction train by a compiler in a manner to avoid 
the competition between the long-length instruction 
and the previous one. Hence, there is no necessity 
for detecting the competition by hardware. 

According to another embodiment of the inven- 
tion, when executing an effective long-length in- 
struction after executing an effective short-length 
instruction and when executing an effective short- 
length instruction after executing an effective long- 
length instruction, by inserting as many nullifying 
instructions as required between both of the 
Instructions, the competition between them can be 
solved on the software. The hardware just needs to 
detect the competition between the short-length 
instruction executed before the current cycle and 
the long-length instruction to be executed at the 
current cycle. According to the invention, therefore, 
by making the number of the short-length instruc- 
tions executed for one machine cycle smaller the 
number of the operations indicated in the long- 
length instruction, the detection of the competition 
between the instructions executed before the cur- 
rent cycle and the instruction executed at the cur- 
rent cycle is made easier as the number of oper- 
ations executed for one machine cycle on the 
average is made higher though the number of 
operations executed for one machine cycle is high 
on the average. 

According to another embodiment of the inven- 
tion, the operated result indicated by the instruction 
is reflected on the instruction later than the next 
instruction by some without having to immediately 
reflect the operated result on the next instruction. 
After executing the instruction, the instruction to be 
executed until the result is reflected serves to read 
a value of the register before writing a value there- 



in. Hence, the number of the registers treated by 
the software is made substantially larger so as to 
implement the optimization of the software for en- 
hancing the number of the parallel operations. 

According to another embodiment of the inven- 
tion, the hardware for detecting the competition is 
made so simple that the machine cycle may be 
improved for enhancing the processing speed. 

According to another embodiment of the inven- 
tion, the long-length instruction for indicating a plu- 
rality of operations is added to the short-length 
instruction having the conventional architecture for 
indicating a single operation for forming a new 
architecture. The upward compatibility is allowed to 
be maintained, because the new architecture is 
arranged to have the conventional architecture. 

These and other objects, feature and advan- 
tages of the present invention will be understood 
more clearly from the following detailed description 
with reference to the accompanying drawings, 
wherein: 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is an overall diagram showing an instruc- 
tion control unit; 

Fig. 2 is a view showing a register composition; 
Fig, 3 is a table for explaining an instruction 
format; 

Fig.4 is a table for explaining an operation of a 
one-word instruction; 

Fig. 5 is a table for explaining an operation of a 
four-word instruction; 

Fig. 6 is a view for explaining pipeline stages; 
Fig. 7 is a view showing pipelines for processing 
a one-word instruction when no competition 
takes place; 

Fig. 8 is a view showing pipelines for processing 
a one-word instruction when a competition takes 
place; 

Fig. 9 is a view showing pipelines for processing 
a four-word instruction; 

Fig. 10 is a view showing a data disposition on 
memory; 

Fig. 11 is a view for explaining how a four-word 
instruction is operated; 

Fig. 12 is a table for explaining a program using 
a four-word instruction; 

Fig. 13 is a block diagram showing an embodi- 
ment of an instruction control unit; 
Fig. 14 is a block diagram showing an integer 
operating unit; 

Fig, 15 is a block diagram showing a floating 
point operating unit; 

Fig. 16 is a block diagram showing a floating 
point register file; 

Fig. 17 is a block diagram showing a floating 
point register; 
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Fig. 18 is a circuit diagram showing a one-bit 
part of the floating point regtsier; 
Fig. 19 is a block diagram showing a floating 
point register; 

Fig. 20 is a view for explaininy an opeialion of a 
shadow register; 

Fig. 21 is a view for explaining an operation of 
the shadow register; 

Fig. 22 is a view for explaining an operation of 
the shadow register; 

Fig. 23 is a table for explaining an operation of a 
mode control circuit; 

Fig. 24 is a tabic for explaining an operation of a 
readout control circuit for tho register; 
Fig. 25 is a block diagram showing a competi- 
tion detecting circuit; 

Fig. 26 is a flowchart showing a compiler; 

Fig, 27 is a circuit diagram showing the mode 

control circuit in detail; 

Fig. 28 is a circuit diagram showing the detail of 
a data cache; 

Fig. 29 is a view showing pipelines arranged 
when a cache miss takes place in the load store 
operation; 

Fig. 30 is a view showing another embodiment 
of the invention; 

Fig. 31 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 32 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 33 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 34 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 35 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 36 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 37 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 38 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 39 is a view showing the embodiment 
shown in Fig. 30; 

Fig. 40 is a view showing the embodiment 

shown in Fig. 30; and 

Fig. 41 is a detailed view of Fig. 1. 

DESCRIPTION OF THE PREFERRED EMBODI- 
MENT 

Hereafter, the description will be directed to a 
preferred embodiment of the invention. In the fol- 
lowing description, the relevant detail to the es- 
sence of the invention is left out. 

Fig. 1 Is a block diagram showing the overall 
computer according to the embodiment of the in- 
vention. As shown, a numeral 1200 denotes a 



memory. A numeral 1300 denotes an instruction 
cache. A numeral 1303 denotes an instruction con- 
trol unit. A numeral 160 denotes an operating unit. 
A numeral 150 denotes an instruction length deler- 

5 mining section. A numeral 109 denotes a competi- 
tion detector. The instruction control unit 1303 
reads an instruction from the instruction cache 
through an interface 170 and decodes it for control- 
ling the operating unit 160 through the interface 

10 180. The operating unit 160 can process a plurality 
of operations in parallel. The computer according 
to this invention includes a four-byte instruction for 
indicating a single operation and a 16-byte instruc- 
tion for indicating a plurality of operations. In the 

75 instruction cache 1300. the 16-byte instructions and 
the four-byte instructions are stored in a mingling 
manner so as to avoid a competition between the 
16-byte instructions and the four-byte instructions. 
The competition detector 109 serves to detect a 

20 competition only among the four-byte instructions. 
The instruction control unit 1303 is provided with 
the instruction length determining section 150 and 
a selector 110 so that the selector 110 can ignore 
an output of the competition detector 109 when 

25 executing the 16-byte instruction or select an op- 
eration to be executed in parallel according to an 
output of the competition detector 109 when ex- 
ecuting the four-byte word. The selector 110 de- 
codes the selected operation and controls the op- 

30 erating unit 160 through the interface 180. In the 
illustrative embodiment, two operating units are 
provided. It goes without saying that they may be 
two or more. 

Next, the description will be oriented to the 

35 register arrangement and the instruction format, the 
pipelines and the operating timing. Finally, the de- 
tail of the overall arrangement shown in Fig. 1 will 
be discussed. 

Fig. 2 shows the register arrangement. FRO to 

40 FR31 denote floating point registers having a 64-bit 
length. RO to R31 denote integer registers having a 
32-bit length. For simplifying the description, the 
floating point data is assumed to be a 64-bit length 
at a double accuracy. An address is swung at each 

45 32 bits. 

In the illustrative embodiment, the short word 
has a one-word length and the long word has a 
four-word length. 

Fig. 3 shows an instruction format. One word 

50 consists of 32 bits. A basic instruction, a branch 
instruction, and a load and store instruction are 
one-word instructions, A compound instruction is a 
four-byte instruction. The basic instruction is an 
operation done between the registers. Though the 

55 long-length instruction is set to have a four-byte 
length, it may be longer or shorter. 

In this embodiment, for simplifying the deschp- 
tion, it is assumed that the four-word instruction is 
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located at consecutive four words delimited by a 
four-word border. This assumption can be easily 
changed. 

At first, the basic instruction will be described. 
As shown in Fig. 3. an OP field denotes a type of 
an operation code. S1 and S2 fields denote num- 
bers of two source registers, respectively. A T field 
denotes a number of a target register. A CC field 
denotes a field indicating how a flag is set. That is, 
an operation denoted by the OP is performed with 
respect to the registers shown by S1 and S2, so 
that tho result is written in the register denoted by 
T. The detail will be shown in Fig. 4. 

Next, the branch instruction will be described, 
d denotes a displacement. In the branch instruc- 
tion, a value of d is added to the program counter 
PC. 

Next, the load and store instruction will be 
described. An F field indicates if the data to be 
loaded or stored is floating point data or integer 
data. A SIZE field indicates a word length of data 
to be loaded or stored as shown in Fig. 4. For the 
integer data, only one word is defined. For the 
floating point data, two to sixteen words are de- 
fined. As shown in Fig. 4, for an FST instruction, 
FR(S1) is written at the R(S2) address. If the SIZE 
field indicates sixteen words, FR(S1) to FR(S1 +7) 
are written at consecutive sixteen words starting 
from the R(S2) address. For an FLD instruction, the 
data at R(S1) + R(S2) addresses are written in FR- 
(T). If the SIZE field indicates sixteen words, the 
consecutive sixteen words starting from the R(S1)- 
+ R(S2) are written in FR(T) to FR(T + 7). 

Next, the four-word compound instruction will 
be described as referring to Figs. 3 and 5. This 
instruction has a function of indicating seven oper- 
ations: a load and store operation indicated by the 
II, 12, IT, SIZE, F fields, an integer operation in- 
dicated by the J1, J2, JT and J fields, a first 
floating point operation indicated by the Ml, M2 
and MT fields, a second floating point operation 
indicated by the A1, A2, AT and A fields, a third 
floating point operation indicated by the N1, N2 
and NT fields, a fourth floating point operation 
indicated by the B1, B2, BT and B fields, and a 
flow control indicated by the CC, d and N fields. 
The detail of each field will be shown in Fig. 5. The 
first and the third floating point operations are a 
multiplication and the second and the fourth float- 
ing point operations are an addition or subtraction. 
The N field denotes the number of no-operation 
cycles to be inserted after this instruction. The way 
of using it will be described later. 

The integer operation will be described. When 
J field'J^IIII, a normal operation Is performed as 
shown in Fig.5. When J field'^1111, the data is 
prefetched from the memory to the cache memory. 
That is, the operation is executed to access the 



cache memory with the addresses as R(J1) + R(J2) 
and transfer data from the memory to the cache 
memory if the necessary data cannot be found out. 
For the one-word instruction, the operated re- 
5 suit is immediately reflected on the next instruction. 
For the four-word instruction, the operated result is 
reflected on the third instruction. The description 
will be oriented to the pipeline arrangement and the 
program which allow this specification to be used. 
10 As shown in Fig. 6, the pipeline arrangement 

has five stages of IF, D, E, F and S. At the IF 
stage, an instruction is read. At the D stage, an 
instruction is decoded. At an E stage, tho read 
from a register and a partial operation are carried 
75 out. At an F stage, an operation is performed. At an 
S stage, the remaining operation and the write of 
an operation to the register are carried out. The 
pipeline arrangement holds true to the integer op- 
eration and the floating point operation. 

20 Fig 7 shows a flow for processing a one-word 

instruction according to this embodiment- This is a 
super scalar system for processing two instructions 
for one machine cycle. The instructions 1 and 2, 3 
and 4. 5 and 6. and 7 and 8 are processed in 

25 parallel unless any competition is detected. This 
super scalar system is discussed in detail in the 
Japanese Patent Application No. 63-283673. 

In turn. Fig. 8 shows how the instruction 3 
treats the operated result of the instruction 2. The 

30 E stages of the instructions 3 and 4 are extended 
until the S stage of the instruction 2 is terminated. 
In order to meet the instruction specification of 
reflecting the operated result of the previous in- 
struction on the current instruction, the competition 

35 is detected by hardware for performing the opera- 
tion shown in Fig. 8. 

Fig. 9 shows how the four-word instruction is 
processed. The four-word instruction is processed 
one for one machine cycle. The operated result of 

40 the instruction 1 is reflected not on the instructions 
2 and 3 but on the instruction 4 according to the 
foregoing specification. The S stage where the 
instruction 1 is written to the register has been 
terminated one before the D stage where the in- 

45 struction 4 is read from the register. Hence, it is 
unnecessary to control the competition by hard- 
ware as described with respect to Fig. 8. According 
to this embodiment, the operating stages are three 
of E, F and S. In general, if the number N of 

60 instructions to be executed before writing an op- 
erated result and the number M of the pipelines 
meet the relation of it is not necessary to 

control the competition by hardware. This embodi- 
ment has concerned with the case where N = 2 and 

65 M = 3 are given. 

II is necessary to place two no-operation four- 
word instructions between a one-word instruction 
and the next effective four-word instruction. Like- 
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wise, it is also necessary to place two no-operation 
four-word instructions between the effective four- 
word instruction and the next one-word instruction. 

Next, the preferable program for processing 
this four-word instruction will be described as refer- 5 
ring to Figs. 10, 11 and 12. Consider that the 
following calculation is to be performed. 

A(i) = A(i) + C X B(i), i^i^24 

70 

where C denotes a constant and A(i) and B(i) 
denote 64-bit floating point data located on the 
memory as shown in Fig. 10. 

Fig. 1 1 is a view for explaining what type of 
operation is done for each cycle for calculating A- 75 
(i). In Fig. 11 , an axis of abscissa denotes a time, 
the unit of which is a machine cycle. The shown 
elongate boxes indicate three stages of E, F and S 
of an operating unit through which the processed 
data passes. The following steps (1) to (10) will be 20 
described. Each of four operations is performed 
about an index i. Those steps (1) to (10) are for 
calculating A(9) to A(12). Hereafter, each process- 
ing step will be described. The constant C is as- 
sumed to be at the FR31 . 25 

(1) Load A(1) to A{4) to FR4 to FR7 and B(7) to 
B(10) to FRO to FR3- 

(2) Store FRO X FR31 in FR8. 

(3) Store FR1 x FR31 in FR9. 

(4) Store FR4 + FR8 in FR12- 30 

(5) Store FR5 + FR9 in FR13. 

(6) Store FR2 x FR31 in FRIO. 

(7) Store FR3 x FR31 in FR11. 

(8) Store FR6 + FRIO in FR14. 

(9) Store FR7 + FR11 in FR15. 35 

(10) Store FR12 to FR15 in A(9) to A(12). 
About the operation scheduling of the steps 

(1) to (10), it is considered that three cycles need 
for performing one process as has been described 
with respect to Fig. 10. The description has con- 40 
cerned with the processes of A(9) to A(12), but the 
similar processing is performed about A(13) to A- 
(16) and A(17) to A(20). Each operating unit is 
pipelined at one cycle pitch. The processes of A- 
(13) to A(16) and A(16) to A(20) are overlapped 45 
with the processes of A(9) to A(12). Hence, the 
processing is allowed to be executed as shown in 
Fig. 11. 

Fig. 12 shows a tour-word instruction train for 
implementing the process shown in Fig. 11. The 50 
operations of A(i) to A(24) can be implemented by 
22 instructions from the instructions 1 to 22 as 
shown in Fig. 12. The seventeen registers from 
FRO to FR15 and FR31 are used. Fig. 10 shows 
which data Is loaded by the instructions 1, 3, 5. 7, 55 
11. 13 and 15 and which data is stored by the 
instructions 12, 14, 16, 18, 22 and 22. The instruc- 
tion 3 is executed to write values in the FRO to FR3 
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but the operated result is reflected on the instruc- 
tion 6 or later. Hence, the values of the FRO to FR3 
loaded by the instruction 1 are allowed to be used 
by the instruction 4. If the conventional system for 
directly reflecting the operated result of the instruc- 
tion 1 on the instruction 2 is used for performing 
the process as shown in Fig. 11. the instruction 3 
cannot be executed to write values in the FRO to 
FR3- It means that new registers such as FR16 to 
FR19 are required. However, the number of the 
used registers is limited. The critical number of the 
registers disadvantageously increases the process- 
ing cycles in number. The program shown in Fig. 
12 needs only seventeen registers for implement- 
ing the operation, because the delay writing opera- 
tion is executed to reflect the operation result of 
the current instruction on the fourth instruction from 
the current one. 

The delay writing operation provides an effect 
of substantially increasing the number of usable 
registers without having to increase the fields of the 
operation code for specifying the registers. 

In Fig. 12, a mark "X" indicates an empty field 
because there exists no data to be operated. Such 
empty fields are often found in the instructions 1 to 
6 issued when starting the processing and the 
instructions 17 to 22 issued when terminating the 
processing. However, such empty fields are al- 
lowed to be decreased by overlapping a series of 
starting processes with the current series of termi- 
nating processes. Further, it is possible to remove 
the instructions 2, 4, 21 having nothing to execute 
by setting each N field of the instructions 1 , 3 and 
20 to "01 ". 

According to the invention, another special in- 
struction is used for specifying a register where the 
operated result is written. Hence, the instructions 4 
and 21 are allowed to be removed. On the other 
hand, the system arranged to specify a register 
where the operated result is written in response to 
the instruction issued to the stage where the writing 
is done requires the instructions 4 and 21 to speci- 
fy the registers where the operated results of the 
instructions 1 and 18 are written. It means that the 
instructions 4 and 21 are disallowed to be re- 
moved. 

When a one-word instruction is issued before 
the program shown in Fig. 12. it is necessary to 
insert two four-word no-operation instructions be- 
fore the instruction 1 shown in Fig. 12. In place, 
only one no-operation four-word instruction having 
an N field set to "01" is needed to be inserted. 
When a one-word instruction is issued after the 
program shown in Fig. 12. the N field of the in- 
struction 22 shown in Fig. 12 is just set to "10". 

By making the instruction longer and providing 
and making use of the N field, it is possible to 
reduce the code size. The one-word instruction has 
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a function of indicating only four operations by four 
words, while the four-word instruction has a func- 
tion of indicating seven operations by four words as 
shown in Fig. 5. 

Next, how to create a pruyrarn will be de- 
scribed. The program is described by a high-level 
language such as FORTRAN or C and then is 
transformed into an instruction train by a compiler. 

Fig. 26 shows a processing flow of the com- 
piler according to the invention. The program writ- 
ten by the high-level language is converted into an 
intermediate code through a lexical unit analyzing 
unit, a syntax analyzing unit, and a meaning ana- 
lyzing unit. The intermediate code is optimized by 
an optimizing unit and then is converted into an 
instruction train as shown in Fig. 3 in a code 
generating unit. Both of the optimizing unit and the 
code generating unit compose a synthesizing unit 
on which the feature of this invention is placed. 
That is, this synthesizing unit provides a paralleling 
section for generating such an instruction train as 
making as many operations to be done in parallel 
as possible in light of the intermediate code. This 
paralleling section uses a four-word instruction If 
the number of operations to be executed in parallel 
is large or a one-word instruction if the number of 
such operations is small. Herein, the judging cri- 
terion as to if the used instruction is a four-word 
one or a one-word one is defined by the number of 
operations to be executed in parallel. The number 
of operations depend on the system. When creat- 
ing a program, this number can be set as a param- 
eter to the compiler. The arrangement as described 
above makes the code size smaller, the using 
efficiency of the main memory and the cache 
memory higher, and the processing speed faster. 

In turn, the description will be oriented to an- 
other feature of the synthesizing unit, that is, al- 
location of the register. The four-word instruction 
needs to be considered, because the operated 
result is reflected on the fourth instruction from the 
current instruction. For example, since the operated 
result of the instruction 1 is reflected on the in- 
struction 4 for the first time, it is economical to 
allocate to the instructions 2 and 3 as many oper- 
ations as possible. At this time, when two or more 
single instructions are operated, a no-operation in- 
struction generating unit serves to insert a no- 
operation instruction. The synthesizing unit is re- 
quired to insert two no-operation four-word instruc- 
tions between the effective four-word instruction 
and the next one-word instruction. Conversely, it Is 
necessary to insert two no-operation four-word 
instructions between the effective four-word Instruc- 
tion and the next one-word instruction. Herein, the 
no-operation instruction can be removed by setting 
the N field as mentioned above. That is. the com- 
piler according to this embodiment serves to detect 
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a competition between long Instructions and speci- 
fy the number of no-operation cycles to be inserted 
after executing an instruction by using the N field. 
Hence, the hardware is not required to delect or 
5 process a competition between long instructions. 

Next, the description will be oriented to a hard- 
ware for processing the instructions described 
above according to an embodiment of the inven- 
tion. Fig. 13 shows the overall detailed arrange- 
w ment of Fig. 1. A numeral 1300 denotes an instruc- 
tion cache. A numeral 130 denotes an instruction 
cache controller. A numeral 1302 denotes a branch 
unit for controlling an instruction processing flow. A 
numeral 1303 denotes an instruction control unit for 
IS decoding an instruction. A numeral 1304 denotes 
an integer operating unit. A numeral 1307 denotes 
a floating point operating unit. A numeral 1306 
denotes a data cache. A numeral 1305 denotes a 
data cache controller. A numeral 1308 denotes a 
20 memory interface unit. 

The instruction control unit 1303 accepts an 
instruction to be executed by the instruction cache 
1300 through a bus 1310 and decodes the instruc- 
tion. Then, the unit 1303 serves to send a control 
25 signal 1318 for the integer operating unit to the 
integer operating unit 1304. a control signal 1314 
for the floating point operating unit to the floating 
point operating unit 1307, a control signal 1312 for 
the branch unit to the branch unit 1302. Further, 
30 the unit 1303 serves to send out a mode signal 110 
to the branch unit 1302 for controlling the program 
counter 3500. The unit 1303 accepts a flat 1317 
from the integer operating unit 1304 and a flag 
1315 from the floating point operating unit 1307. 
35 The integer operating unit 1304 sends out an 

operand address 1319 to the data cache 1306 and 
the data cache controller 1305. The data read from 
the data cache is sent to the integer operating unit 

1304 or the floating point operating unit 1307 
40 through the data bus 1320. If no desired data Is 

found in the data cache, the data cache controller 

1305 issues an interface signal 1321 for starting 
the memory interface unit 1308 so that It may read 
data from the main memory. The controller 1305 

45 controls the queuing for this operation together with 
the instruction control unit 1316 through the effect 
of the signal 1316. 

The branch unit sends out an address 1309 of 
a next Instruction to be read out to the instruction 

50 cache 1300 and the instruction cache controller 
1301. If no desired instruction is found in the In- 
struction cache 1300, the instruction cache control- 
ler 1301 issues an interface signal 1313 for starting 
the memory interface unit 1308 so that it may read 

56 a desired instruction from the main memory. The 
controller 1301 controls the queuing for this opera- 
tion together with the instruction control unit 1316 
through the effect of a signal 1311. 

8 
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The detail of the integer operating unit 1304 is 
shown in Fig. 14. A numeral 1400 denotes a de- 
coder. A numeral 1401 denotes a first ALU 
(Arithmetic and Logic Unit). A numeral 1402 de- 
notes a second ALU. A numeral 1403 denotes an 
integer register. The first ALU accepts data from 
the integer* register file 1403 through source buses 
1406 and 1407 and gives back the operated result 
to the integer register file 1403 through a target 
bus 1322. The second ALU accepts data from the 
integer register file 1403 through source buses 
1408 and 1409 and gives back the operated result 
to the integer register file 1403 through a target 
bus 1319. 1317-1 denotes a flag output from the 
first ALU. 1317-2 denotes a flag output from the 
second ALU. Numerals 1319 and 1322 denote bus- 
es which are led to the data cache 1306 as an 
address when executing a load and store and a 
prefetch operations. 

Fig. 15 shows the detail of the floating point 
operating unit 1307 shown in Fig. 13. A numeral 
1501 denotes a decoder. A numeral 1502 denotes 
a floating point register file. A numeral 1503 de- 
notes a first multiplier. A numeral 1504 denotes a 
second multiplier. A numeral 1505 denotes a first 
adder. A numeral 1506 denotes a second adder. 
The floating point register file 1502 serves to send 
out data to the first multiplier 1503 through source 
buses 1517 and 1518, the second multiplier 1504 
through source buses 1515 and 1516, the first 
adder 1505 through source buses 1513 and 1514, 
the second adder 1506 through source buses 1511 
and 1512. Each operated result is given back to the 
floating point register file through target buses 
1507, 1508. 1509 and 1510 for writing it therein. 

1315-1 denotes a flag of the first multiplier 
1503. 1315-2 denotes a flag of the second multi- 
plier 1504. 1315-3 denotes a flag of the first adder 
1505. 1315-4 denotes a flag of the second adder 
1506. 

Fig. 16 shows the detail of the floating point 
register file 1502. Numerals 1600 to 1608 denote 
floating point registers. Numerals 1314-1 to 1314-9 
denote control signals of the floating point registers 
1600 to 1608, respectively. A numeral 1610 de- 
notes a load aligner, A numeral 1609 denotes a 
store aligner. Numerals 1611 to 1618 denote buses 
connecting between the floating point registers 
1600 to 1608 and memories. The bus 1611 is 
connected to the FRO 16 and 12 and the bus 1612 
is connected to the registers FRI, 9, 17 and 25. So 
are the buses 1613, 1614, 1615, 1616 and 1617. 
The bus 1618 is connected to the FR7. 15. 23 and 
31. When executing the load instruction, the data 
sent through the bus 1320 is put on a desired one 
of the buses 1611 to 1618 through the effect of the 
load aligner 1610 and then is written in a desired 
register. When executing the store instruction, the 
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data is read from the register to the buses 161 1 to 
1618 and then is output to a desired location of the 
bus 1320 through the effect of the store aligner 
1609. 

5 Fig, 17 shows a first embodiment of the float- 

ing point register 1600 shown in Fig. 16. 1601 to 
1608 are the same as the register 1600. As shown 
in Fig. 17, the register 1600 is a set of 64-bit 
registers. Numerals 1700 to 1763 denote one-bit 

70 registers, respectively. Numerals 1511-00 to 
15180-00 denote readout buses of the register 
1700. Numerals 1507-00 to 1518-00 denote write 
buses of the register 1700. A numeral 1611-0 de- 
notes a read and write bus of the register 1700. So 

75 is the bus arrangement of the register 1763. 

Fig. 28 shows the detail of the data cache 
shown in Fig. 13. A numeral 2801 denotes a data 
array for holding data. A numeral 2800 denotes an 
address array used for a load and store operation. 

20 A numeral 2802 denotes an address array for 
prefetching. The address arrays 2800 and 2802 
hold the same data. When executing the load and 
store one-word instruction, the address array 2800 
and the data array 2801 are accessed through the 

25 buses 1319 and 1322. When executing a load and 
store four-word instruction, the address array 2800 
and the data array 2801 are accessed through the 
bus 1322. The address array 2802 is accessed for 
prefetching through the effect of the bus 1319. 

30 Fig. 29 shows a pipeline arrangement for when 

a cache miss takes place in the load and store 
operation. The pipeline is locked while the data is 
transferred from the memory to the cache memory. 
In Fig. 29, <t> denotes a locking period. 

35 On the other hand, when performing a 

prefetching operation, nothing is done if the ad- 
dress array 2802 is hit. If a miss hit appears in an 
address, the block containing the address is trans- 
ferred from the memory to the data array 2801 

40 through a bus. During the period, the pipeline is 
locked. By setting a prefetching operation before a 
load and store operation having a miss-occurring 
possibility, the transfer of the data from the mem- 
ory to the cache memory is carried out in parallel 

45 to another operation. It is therefore possible to 
avoid towering of the performance resulting from 
the locked pipeline. 

Fig. 18 shows a circuit arrangement of the 
register 1700 shown in Fig. 17. Numerals 1816 and 

50 1817 denote inverters. Numerals 1802 to 1815 de- 
note clocked inverters. When the control signals 
1314-1 to 1314-8 go up to a high level, the regis- 
ters output their values to the buses 1511-00 to 
1518-00. When the signals 1314-1-10 to 1314-1-14 

56 go up to a high level, the values on the buses 
1510-00 to 1507-00 are written in the registers. 
When the signal 1314-1-9 goes up to a high level, 
the value is output from the register to the bus 
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1611-00. When the signal 1314-1-10 goes up to a 
high level, the value on the bus 161 1-00 is written 
in the register. A signal 1800 corresponds to an 
extra readout port and a signal 1801 corresponds 
to an extra write port. The way of use of these i 
signals will be described later. 

Fig. 19 shows a second embodiment of the 
floating point register 1600 shown in Fig. 16. The 
embodiment shown in Fig. 19 is different from that 
shown in Fig. 17 in the respect that first shadow r 
registers 1900 to 1963 and second shadow regis- 
ters 2000 to 2063 arc added. The first shadow 
register 1900 serves to pass the signal 1800 so 
that it can read a value stored in the register 1700. 
Further, the register 1900 serves to pass the signal 1, 
1964 so that it can send out a value to the second 
shadow register 2000. The second shadow register 
2000 serves to send out its value to the register 
1700 through the signal 1801. That is, the registers 
1700 to 1763, the first shadow registers 1900 to 21 
1963 and the second shadow registers 2000 to 
2063 compose a ring-like shift register. Like the 
registers 1700 to 1763. the first shadow registers 
1900 to 1963 and the second shadow registers 
2000 to 2063 are capable of reading and writing 2i 
data through the buses 1611-00 to 1611-63. A 
numeral 1314-1-15 denotes a control signal of the 
first shadow registers 1900 to 1963. A numeral 
1314-1-16 denotes a control signal of the second 
shadow registers 2000 to 2063. 3C 

The object of the shadow register is to allow 
return from an interrupt when executing a four-word 
instruction. How the shadow register operates will 
be described as referring to Figs. 20 to 22. A W* 
stage is a stage at which a value is written from the 35 
register to the first shadow register FRSl. A W" 
stage is a stage at which a value is written from the 
shadow register FRSl to the second shadow regis- 
ter FR82. 

Fig. 20 shows an operation of a four-word 4o 
instruction at a normal time when no interrupt takes 
place. Figures on the time charts FR, FRSl and 
FRS2 indicate the instructions about which the op- 
erated results are stored in the registers, respec- 
tively. As shown in Fig. 20, at a normal time, the 45 
operated result Is shifted from the FR to the FRSl 
and the FRSl to the FRS2 at one cycle pitch. 

Fig. 21 shows an operation appearing when an 
interrupt takes place between the instructions 3 and 
4. The instructions 4. 5. 6, 7 are nullified. Each 50 
register stops updating of a value after an interrupt 
takes place. The FR holds an operated result of the 
instruction 3. The FRSl holds an operated result of 
the instruction 2. The FRS2 holds an operated 
result of the instruction 1. An interrupt vector is set 55 
to the program counter. The interrupt processing 
program starting from the interrupt vector operates 
to save the values of the FR, the FRSl and the 



FRS2 in the memory. 

Fig. 22 is a view for explaining a returning 
operation from the interrupt processing. At the final 
stage of the interrupt processing program, as 
shown in Fig. 22, the operated result of the instruc- 
tion 1 is returned to the FR. The operated result of 
the instruction 2 is returned to the FR2. The op- 
erated result of the instruction 3 is returned to the 
FR1. At an E stage of the instruction 4 for reading 
data from the register, the operated result of the 
instruction 1 is allowed to be viewed. After termi- 
nating the E stage of tho instruction 4, the value of 
the FR is copied to the FRS1, the value of tho 
FRSl is copied to the FRS2 and the value of the 
FRS2 is copied to the FR. As a result, at an E 
stage of the instruction 5, the operated result of the 
instruction 2 is allowed to be viewed. After termi- 
nating the E stage of the instruction 5 for reading 
data from the register, at the E stage of the instruc- 
tion 6 for reading data from the register, the op- 
erated result of the instruction 3 is allowed to be 
viewed by the similar operation. The subsequent 
process is true to the normal process. That is, each 
time one instruction is executed, the value is 
copied from the FR to the FRS1 and the FRSl to 
the FRS2. The value of the FRS2 is discarded. 

As described above, the shadow registers are 
provided to accept an interrupt when executing a 
delay writing instruction and return to the original 
stage. The lack of the shadow registers allows only 
the operated result of the instruction 3 to be saved 
as shown in Fig. 21. It means that the instruction 4 
does not have any function of viewing the operated 
result of the instruction 1 when returning to the 
original stage from the interrupt as shown in Fig, 
22. This is because the instructions 2 and 3 may 
instruct to write the values in the same register as 
the instruction 1. For example, in the program 
shown in Fig. 12, the instruction 3 instructs to write 
the data in the same register as the instruction 1. 

The increase of hardware resulting from the 
addition of the shadow register wilt be described 
later. The size of the register is substantially pro- 
portional to the number of ports. As is obvious from 
the comparison between Figs, 17 and 19, the num- 
ber of the ports provided in the shadow register is 
3, which is far smaller than the number of the ports 
provided in the register, that is, 13. Hence, the 
increase of the hardware resulting from the addition 
of the shadow register is negligible. 

Fig. 41 shows an embodiment of the Instruction 
control unit 1303 shown in Fig. 13. A numeral 150 
denotes an instruction word length determining 
unit. A numeral 101 denotes a first instruction reg- 
ister. A numeral 102 denotes a second instruction 
register. A numeral 103 denotes a third instruction 
register. A numeral 104 denotes a fourth instruction 
register. A numeral 4100 denotes a mode register. 
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A numeral 100 denotes a mode control circuit. A 
numeral 105 denotes a register reading control 
circuit. A numeral 106 denotes a register writing 
control circuit. A numeral 107 denotes a function 
control circuit. A numeral 108 denotes a pipeline 
control circuit- A numeral 109 denotes a competi- 
tion detecting circuit. 

It is assumed that the four-word instruction is 
located in a manner not to bridge a border between 
the adjacent four words and the one-word instruc- 
tion is executed each two words surrounded by a 
two-word border at a time. In this embodiment, the 
instruction word length determination is carried out 
by viewing the leftmost bit of an operation code, 
that is. the signal 1310-1-1 (COOO) itself of Fig. 41. 

Fig. 27 shows the detail of the mode control 
circuit 100. In Fig. 27, a numeral 2700 denotes a 
control circuit. A numeral 2701 denotes a register 
for holding an N field 1310-4-1. A numeral 2702 
denotes a decremeter. A numeral 2703 denotes a 
comparator. The comparator 2703 serves to send 
out an output signal VALID (2704) to the control 
circuit 2700. The value of the N field set to the 
register 2701 is decremented by one at each one 
cycle by the decremeter 2702. When the N field 
reaches 00, the signal VALID (2704) is asserted. 
The signal VALID indicates insertion of a no-opera- 
tion cycle when it is negated and execution of an 
instruction when it is asserted. 

The control circuit 2700 serves to check an 
output 116 of the competition detecting circuit 
(BUB), a lower 32nd bit of the instruction address 
1309-1 (CA30), a bit 1310-1-1 (COOO) indicating 
whether or not the instruction consists of four 
words in the operation code, and a signal 2704 
(VALID). Then, the control circuit 2700 determines 
the current mode of the five modes. As shown in 
Fig. 23, the control circuit 2700 sets an operation 
code to the first to the fourth instruction registers 
and issues the signal 110 for incrementing the 
program counter. 

The signal 110 indicating a mode to which the 
current cycle belongs is latched by the mode reg- 
ister 4100. The mode register 4100 serves to sup- 
ply a signal 130 to the register reading control 
circuit 105, the register writing control circuit 106, 
the function control circuit 107, the pipeline control 
circuit 108, and the competition detecting circuit 
109. As shown in Fig. 23, CO to C3 denote four 
words within the four-word border in a manner that 
the words are ranged from a smaller address like 
CO. CI, C2, C3. From the leftmost bit COOO of CO. 
as shown in Fig. 3, it is possible to determine if the 
instruction is a one-word one or a four-word one. A 
one-word instruction mode 1 is a mode at which 
the two left instructions (CO. CI ) inside of the four- 
word border are executed. At this mode, CO is set 
to the first instruction register, CI is set to the 
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second instruction register, and the program coun- 
ter PC is incremented by +2. A one-word instruc- 
tion mode 2 is a mode at which the two right 
instructions (02, C3) inside of the four-word border 

5 are executed. At this mode. C2 is set to the first 
instruction register. C3 is set to the second instruc- 
tion register, and the program counter PC is incre- 
mented by + 2. That is, when executing a one- 
word instruction, the first and the second instruc- 

70 tion registers are made operative, while the third 
and the fourth instruction registers are not made 
operative. A four-word instruction mode is a mode 
at which a four-word instruction (CO. CI, C2, C3) is 
executed. At this mode, CO to C3 are set to the 

76 first to the fourth instruction registers and the pro- 
gram counter PC is incremented by +4, A com- 
petition mode appears when the competition de- 
tecting circuit 109 detects a competition. The first 
to the fourth instruction registers and the mode 

20 register 4100 hold the value of the previous cycle. 
Further, at this mode, no update of the program 
counter PC is performed. A no-operation instruction 
mode appears when insertion of a no-operation 
instruction (NOP) to the current cycle by hardware 

25 is indicated by an N field having a four-word in- 
struction executed before the current cycle. The 
no-operation instruction is set to the instruction 
register and the program counter PC is not up- 
dated- As a result, one no-operation cycle is in- 

30 serted. 

When executing a one-word instruction, for ex- 
ecuting the CO or C2, the first ALU 1401 (see Fig. 

14) . the first multiplier 1503 (see Fig. 15), and the 
first adder 1505 (see Fig. 15) are used, while for 

35 executing the CI or C3, the second ALU 1402 (see 
Fig. 14), the second multiplier 1504 (see Fig. 15) 
and the second adder 1506 (see Fig. 15) are used. 
When executing a four-word instruction, the ad- 
dress calculation for a load and store operation is 

40 executed in the first ALU 1401 (see Fig. 14), the 
integer operation is executed in the second ALU 
1402 (see Fig. 14), the first floating point operation 
is executed in the first multiplier 1503 (see Fig. 

15) , the second floating point operation is executed 
45 in the first adder 1505 (see Fig. 15), the third 

floating point operation is executed in the second 
multiplier 1504 (see Fig. 15). and the fourth floating 
point operation is executed in the second adder 
1506 (see Fig. 15). 

50 The register reading control circuit 105, the 

register writing circuit 106 and the function control 
circuit 107 shown in Fig. 41 serve to generate a 
control signal 1318 of the integer operation unit 
1304 (see Fig. 13) by using the mode specifying 

55 signal 110 output from the mode control circuit and 
the values of the first to the fourth instruction 
registers according to the foregoing rules for al- 
location of the operating unit. The further detail of 
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the register reading control circuit will be shown in 
Fig. 24, which shows which field of the operation 
code is used for specifying the registers input to 
two inputs of each of the six operating units. The 
abbreviations of the fields are stiown in the col- 
umns of the connpound instructions shown in Fig. 
3. The four-word instructions J1 and A1 are located 
at the position of the one-word instruction S2 and 
the four-word instructions J2 and A2 are located at 
the position of the one-word instruction S2. By 
making use of this locational arrangement, the field 
specification shown in Fig. 24 is doscribod by 
using Jl, J2, Al and A2 when executing a one- 
word instruction. This is for distinguishing CO from 
CI. 

Next, the description will be oriented to the 
competition detecting circuit 109 shown in Fig. 41. 
As described with respect to Figs. 7 to 9, this 
embodiment is not required to detect a competition 
between the four-word instructions. Since all the 
operating units for executing a one-word instruction 
are doubled, no competition of the operating unit 
takes place between the two one-word instructions 
executing in parallel. For simplifying the descrip- 
tion, no register competition takes place. The ex- 
pansion of this embodiment to the register com- 
petition is made easy as described in the Japanese 
Patent Application No. 63-283673. As mentioned 
above, two no-operation four-word instructions are 
required to be placed between an effective four- 
word instruction and the next one-word instruction. 
Hence, no competition takes place between the 
four-word instruction and the one-word instruction. 
It means that the competition detecting circuit just 
needs to detect the one-word instruction of the 
current cycle and the one-word instruction execut- 
ed before the cycle. Under the control of the mode 
control circuit 100. a one-word instruction is set to 
only the first and the second instruction registers 
101 and 102. Hence, the competition detecting 
circuit 109 needs to view the first and the second 
instruction registers 101 and 102 only. It means 
that the circuit 109 does not need to view the third 
and the fourth instruction registers 103 and 104. 

Fig. 25 is a block diagram showing the em- 
bodiment of a competition detecting circuit 109. 
Numerals 2501 to 2504 denote registers. A nu- 
meral 2505 denotes a mask circuit. Numerals 2506 
to 2521 denote comparators. In Fig. 7. considering 
that the instructions 7 and 8 are current ones, the 
competition among the instructions 3 to 6 is con- 
sidered to be detected. The E stage of the instruc- 
tions 7 and 8 comes next to the S stage of the 
instructions 1 and 2. Hence, no competition be- 
tween the instructions 1, 2 and 1, 8 takes place. As 
shown in Fig. 25, the register 2501 stores a num- 
ber of the register written by the instruction 5, the 
register 2503 stores a number of the register writ- 



ten by the instruction 6, the register 2502 stores a 
number of the register written by the instruction 3, 
and the register 2504 stores a number of the 
register written by the instruction 4. The four regis- 
5 ters and the numbers of the four registers read 
from the instructions 7 and 8 are compared by 
sixteen comparators 2506 to 2521. As a result, the 
compared result is sent to the mask circuit 2505. 
The mask circuit 2502 serves to view the output 
TO 130 of the mode control circuit 100 and the output 
115 of the pipeline control circuit 108 for determin- 
ing whether or not a hit signal of the comparator is 
effective. If yes, the signal 116 indicating the com- 
petition is asserted. That is. even if the output of 
76 the comparator indicates a match of the registers, 
the signal 116 is negated if the instruction is nulli- 
fied. When the mode signal 130 indicates a four- 
word mode, the mask circuit 2505 serves to negate 
the signal 116. 
20 Next, the description will be directed to the 

pipeline control circuit 108. The pipeline control 
circuit 108 serves to send out a mode signal 130, a 
flag signal 1317 from the integer operating unit 
1304 shown in Fig. 13. and a flag signal 1315 from 
25 the floating point operating unit 1307 in Fig. 13. 
Further, the circuit 108 sends out a control signal 
1312 of the branch unit 1302 in Fig. 13 through the 
effect of the interface 1316 with the data cache 
controller in Fig. 13 and the interface 1311 with the 
20 instruction cache controller in Fig. 13 for the pur- 
pose of controlling the branch unit. That is. when 
an effective branch instruction comes to the circuit 
108, the circuit 108 serves to operate the branch 
unit 1302. At the other time, by using the mode 
35 signal 110, the circuit 108 serves to control the 
program counter located in the branch unit shown 
in Fig, 23. The pipeline control circuit 108 serves to 
send out the signal 115 to the register reading 
control circuit 105, the register writing control cir- 
40 cuit 10, the function control circuit 107, and the 
competition detecting circuit 109 for controlling the 
state of the pipeline. That is. If a miss takes place 
when accessing the instruction cache or data 
cache, as shown in Fig. 29, the pipeline is locked. 
45 In turn, a first transformation of the foregoing 

embodiment will be described. The foregoing em- 
bodiment IS arranged to reflect the operated result 
of the four-word instruction on the fourth instruction 
from the current (first) instruction for eliminating the 
50 necessity of the competition detecting unit between 
the four-word instructions. To achieve the similar 
effect, the embodiment may be arranged to reflect 
the operated result of the four-word instruction on 
the next instruction but to avoid the competition 
55 among the four-word instructions by using the 
compiler. Concretely, when a four-word instruction 
instructs to write data in a register, the next two 
four-word instructions do not read the data from 
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that register. This arrangement loses the effect of 
substantially increasing the number of the registers 
according to the first embodiment but does not 
need to provide the shadow registers shown in Fig. 

19- 5 

Further, a second transformation of the fore- 
going embodiment will be described. The embodi- 
ment shown in Fig. 3 provides in an instruction a 
bit indicating if the instruction is a one-word or a 
four-word one. In place, a flag indicating if the io 
instruction is a one-word or a four-word one is 
provided in the computer so that this flag is con- 
trolled by the instruction. This arrangement re- 
quires an instruction for controlling the flag is re- 
quired but does not advantageously need to in- 75 
dicate the word length in each instruction merely 
by switching the flag once. 

A third transformation of this embodiment will 
be described as referring to Figs. 30 to 32. Like 
Fig. 30, this transformation is arranged to expand 20 
the number of the floating point registers from 32 
to 128. Fig. 31 shows an instruction format and Fig. 

32 shows an instruction. The FRO to 31 are usable 
by both of the basic instruction and the compound 
instruction. The FR32 to FR127 are the registers to 25 
be usable only by the compound instruction. Each 

of the fields for specifying the registers II, IT, Ml, 
MT, A1, AT, N1, NT, B1 and BT includes seven 
bits as shown in Fig. 32, which are more than the 
foregoing embodiment. This transformation is ar- 30 
ranged to match one side of the source register to 
the target register for accommodating the total in- 
struction in four words. If it is not desirous, the 
word length may be made longer. This transforma- 
tion is arranged to add the compound instruction to 35 
the basic instruction so that more registers than 
those used by the basic instruction are allowed to 
be treated by the compound instruction. As such, 
the usable registers can be increased in number. 
This transformation allows the FRO to FR31 to be 40 
accessed by both of the basic instruction and the 
compound instruction. In place, it is possible to 
independently provide 32 registers for the basic 
instruction and 128 registers for the compound 
instruction. 45 

Further, as shown in Fig. 32, the transformation 
is arranged to specify the number of words to be 
prefetched when prefetching data from the memory 
to the cache by the JT field. With this arrangement, 
the instruction is allowed to instruct to transfer a 50 
plurality of blocks at a time. This yields an effect of 
enhancing a utilization. 

In turn, a fourth transformation of the foregoing 
embodiment will be described as referring to Figs. 

33 and 34. The different respect of the fourth 55 
transformation is that prefetch of the data is carried 

out not by the integer operating fields such as J1, 
J2, and JT but by a one-bit P field. In the case of 
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P=i. the block next to the block containing an 
address used in the load and store operation is 
prefetched. This operation has a capability of con- 
serving the field needed for prefetching and offer- 
ing an advantage of specifying three operations of 
a load and store operation, an integer operation 
and a prefetch operation in parallel. 

Fig. 35 is a block diagram used for explaining 
another embodiment. In Fig. 35. a numeral 3500 
denotes a program counter. A numeral 3501 de- 
notes a memory for storing instructions. A numeral 
3502 denotes a mask switch circuit. Numerals 3503 
to 3506 denote M instruction registers each having 
an n-byte length. A numeral 3507 denotes a de- 
coder. Numerals 3508 and 3509 denote L (L^1) 
operating units. A numeral 150 denotes an instruc- 
tion length determining unit. A numeral 109 de- 
notes a competition detecting circuit. A numeral 
100 denotes a mode control circuit. A numeral 
4100 denotes a mode register. The program coun- 
ter 3500 sends out an instruction address 3513 to 
the memory 3501 for storing instructions. In the 
memory 3501. n-byte length instructions and n x 
M byte length instructions are stored in a mingling 
manner. The memory 3501 serves to send out a 
plurality of instructions containing an instruction 
indicated by the instruction address 3513 to the 
mask switch circuit 3502. If the instruction has n 
bytes, the mask switch circuit 3502 sets the in- 
struction to at least one of N (1^N<) instruction 
registers 3503 to 3504 inside of the M instruction 
registers. If the instruction has n x M bytes, the 
mask switch circuit 3502 sets the instruction to the 
instruction registers 3503 to 3506. The decoder 
3507 serves to decode the instructions 3519 to 
3522 from the instruction registers 3503 to 3506 
and control one operating unit by using the control 
signals 3523 and 3524. The instruction length de- 
termining unit 150 views at least part of the instruc- 
tion 3514 and sends out to the mode control circuit 
100 a signal 3526 indicating an instruction length. 
The competition detecting circuit 109 views the 
instruction registers 3503 to 3504 and sends out to 
the mode control circuit 100 a signal 116 indicating 
the presence or absence of a competition between 
n-byte instructions. The mode control circuit deter- 
mines the current mode based on the instruction 
length, the presence of a competition and a value 
of the program counter and sends out a control 
signal 110 for controlling the program counter, the 
mask switch circuit and the decoder. 

Next, the correspondence between this em- 
bodiment and the embodiments shown in Figs. 1 to 
29 and 41 will be described. The embodiment 
shown in Figs, i to 29 has been arranged on the 
assumption that n = 4, M = 4, N=2 and L = 2 and 
the operating unit means the integer operating unit 
and the floating point operating unit. Further, tJie 
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mask switch circuit shown in Fig. 35 corresponds 
to a selector for generating instructions to be set to 
the first to the fourth instruction registers 101 to 
104 and a mask operated by an NOP. The com- 
petition detecting circuit 109 shown in Fig. 35 cor- 5 
responds to the competition detecting circuit 109 
shown in Fig. 41. The instruction length determin- 
ing unit 150 shown in Fig. 35 corresponds to the 
instruction length determining unit 150 shown in 
Fig. 41. The mode control circuit 100 shown in Fig. 10 
35 corresponds to the mode control circuit 100 
shown in Fig. 41 . 

The embodiments shown in Figs. 3, 5, 10, 11 
and 12 have a shortcoming that the data location 
on the memory is restricted. The embodiment 75 
shown in Figs. 36 to 40 overcomes this shor- 
tcoming. The compound instruction of this embodi- 
ment is capable of executing two memory oper- 
ations such as load or store by one instruction. In 
light of the hardware, the cache is arranged to have 20 
two ports or access the memory twice for one 
machine cycle. Figs. 39 and 40 show a program for 
solving the same problem as that shown in Figs. 1 1 
and 12. 

The embodiments having been described 25 
above are devised, in a computer having both of 
long-word instructions and short-word instructions, 
to reflect an operated result of a long word instruc- 
tion on and from any later instruction or allow a 
long instruction to specify a number of no-opera- 30 
tion instructions succeeding to the next instruction. 
Or, the long instruction is devised to provide a first 
field for transferring data from the memory or the 
cache memory to the register and a second field 
for transferring data from the memory to the cache 35 
memory. These devices are effective to the VLIW 
computer having only long-word instructions. 

As set forth above, this embodiment is ar- 
ranged to use a four-word instruction for specifying 
seven operations with the four words. The competi- 40 
tion detection is carried out by the 4 x 4 = 16 
comparators. To detect the competition among the 
four-word instructions by hardware, it is necessary 
to provide six writing registers for operations ex- 
cept a branch operation of the previous cycle and 45 
six writing registers for operations of the cycle 
previous to the previous cycle and to detect a 
competition among twelve reading registers of the 
current cycle. As such. (6 + 6)xl2 = 144 compara- 
tors are required in total. On the other hand, this 50 
embodiment has an advantage of needing just 
16/144 hardwares. 

According to this embodiment, the short-word 
instruction processed for one machine cycle in- 
dicates two operations, while the four-word instruc- 56 
tion indicates seven operations. Hence, the com- 
petition detecting circuit for the two operations has 
an effect of executing seven operations at maxi- 



mum in parallel. 

The invention has an effect of increasing the 
number of operations to be executed in parallel for 
enhancing the performance. 

The invention has another effect of reducing 
the code size. This results in enhancing a hit ratio 
of the code cache, thereby enhancing the perfor- 
mance. 

The invention has another effect of facilitating 
detection of a competition among the operations 
executed in parallel by hardware. This effect makes 
contribution to enhancing a machine cycle, reduc- 
ing the amount of hardware and lowering the cost. 
In particular, when the number of operations speci- 
fied by a long instruction is large, this effect is 
remarkable. 

This invention has another effect of facilitating 
detection of a competition or queuing between the 
instruction executed before the current cycle and 
the instruction executed at the current cycle. This 
effect makes great contribution to enhancing the 
machine cycle, reducing the amount of hardware 
and lowering the cost. 

This invention has another effect of substan- 
tially increasing the registers using the software in 
number and executing the optimization on the soft- 
ware for increasing the number of operations ex- 
ecuted in parallel, thereby enhancing the perfor- 
mance. 

The invention has another effect of keeping 
upward compatibility with the conventional architec- 
ture. 

Claims 

1. A computer having: 

registers (FRO to FR31, RO to R31); 

a memory (1200); 

a program counter (3500); 

a parallel operating function of reading one 
or more instructions from said memory (1200) 
indicated by said program counter (3500) and 
executing an operation indicated by said in- 
struction with respect to said register (FRO to 
FR31, RO to R31), said memory (1200) and 
said program counter (3500); 

instruction word length determining means 
for determining if said instruction indicated by 
said program counter (3500) is an instruction 
having a long word length or an instruction 
having a short word length; and 

instruction selecting means for setting said 
instruction to said register (FRO to FR31. RO to 
R31) if said instruction indicated by said pro- 
gram counter (3500) is determined to be the 
instruction having a long word length and set- 
ting said instruction to a predetermined regis- 
ter if said instruction Indicated by said program 
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counter is determined to be the instruction 
having a short word length. 

2. A connpuler having: 

registers (FRO to FR3l. RO to R31); 

a memory (1200); 

a program counter (3500); 

a parallel operating function of reading one 
or more instructions from said memory (1200) 
indicated by said program counter (3500) and 
executing an operation indicated by said in- 
struction with respect to said register (FRO to 
FR31. RO to R31), said memory (1200) and 
said program counter (3500); 

instruction word length determining means 
(150) for determining if said instruction indi- 
cated by said program counter (3500) is an 
instruction having a long word length or an 
instruction having a short word length; 

competition detecting means (109) for de- 
tecting a competition among said instructions 
each having a short word length; 

instruction selecting means (110) for set- 
ting said instruction to said register (FRO to 
FR31, RO to R31) if said instruction indicated 
by said program counter (3500) is determined 
to be the instruction having a long word length 
and setting said instruction to a predetermined 
register if said instruction indicated by said 
program counter is determined to be the in- 
struction having a short word length and no 
competition is detected by said competition 
detecting means. 

3. A computer having: 

registers (FRO to FR31, RO to R31); 

a memory (1200); 

a program counter (3500); 

a parallel operating function of reading one 
or more instructions from said memory (1200) 
indicated by said program counter (3500) and 
executing an operation indicated by said in- 
struction with respect to said register (FRO to 
FR31, RO to R31), said memory (1200) and 
said program counter (3500); 

instruction word length determining means 
(150) for determining if said instruction indi- 
cated by said program counter (3500) is an 
instruction having a long word length or an 
instruction having a short word length; 

competition detecting means (109) for de- 
tecting a competition among instructions each 
having a short word length if said instruction is 
determined to be the instruction having a short 
word length by said instruction word length 
determining means (150); 

operating means (160) for executing a pre- 
determined number of said instructions having 
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a short word length for one machine cycle 
according to the content of said competition 
detecting means (109) if said instruction is 
determined to be the instruction having a short 
word length or executing a predetermined 
number of said instructions having a long word 
length for one machine cycle if said instruction 
is determined to be the instruction having a 
long word length. 



4. A computer as claimed in Claim 1, 2 or 3, 
wherein when the number of operations to be 
executed in parallel is largo, said instruction is 
the instruction having a long word length and 

75 when the number of operations to be executed 

in parallel is small, said instruction is the in- 
struction having a short word length. 

5. A computer as claimed in Claim 1 , 2 or 3, 
20 wherein a compiler is used for switching said 

instruction having a long word length and said 
instruction having a short word length. 



6. A computer as claimed in Claim 1. 2 or 3. 
wherein the number of operations specified in 
the instruction having a long word length is 
larger than the number of instructions each 
having a long word length processed for one 
machine cycle. 



7. A computer as claimed in Claim 1, 2 or 3, 
being arranged to generate an instruction train 
by a compiler in a manner to avoid a competi- 
tion between the current instruction having a 
35 long word length and the previous instruction 

having a long word length and performing the 
processing based on the output of said com- 
piler. 

40 8. A computer as claimed in Claim 1, 2 or 3 
being arranged to insert a predetermined num- 
ber of no-operation instructions between said 
effective instruction having a long word length 
and said effective instruction having a short 

45 word length, when executing said effective in- 

struction having a long word length after ex- 
ecuting said effective instruction having a short 
word length and when executing said effective 
instruction having a short word length after 

50 executing said effective instruction having a 

long word. 

9. A computer as claimed in Claim 1, 2 or 3. 
wherein said instruction having a long word 
55 length has a function of specifying any number 

of no-operation instructions following the next 
instruction. 
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10. A computer as claimed in Claim 1, 2 or 3, 
wherein the operated result of said instruction 
having a long word length is adjusted to be 
reflected on an instruction located on or from 

any subsequent position. 5 

11. A computer as claimed in Claim 10, wherein 
assuming that N denotes a predetermined 
number of said instructions placed until the 
instruction where the operated result is re- io 
fleeted and M denotes a number of pipeline 
stages, the relation N^M is established. 

12. A computer as claimed in Claim 10. further 
having storage means for storing the previous 75 
content of said registers (FR0-FR31, R0-R31) 

for a constant cycle. 

13. In a computer having registers (FRO to FR31, 

RO to R31), a memory (1200) and a program 20 
counter (3500) and having a parallel operating 
function of reading one or more instructions 
from said memory (1200) indicated by said 
program counter (3500) and executing an op- 
eration indicated by said instruction with re- 25 
spect to said register (FRO to FR31, RO to 
R31), said memory (1200) and said program 
counter (3500), 

said computer being characterized by re- 
flecting an operated result of an instruction 30 
having a long word length on or from any later 
instruction. 

14. A computer as claimed in Claim 13 further 
having storage means for storing the previous 35 
content of said registers (FRO to FR31, RO to 
R31) for a constant cycle. 

15. A computer having: 

registers (FRO to FR31. RO to R31); 

a memory (1200); 

a program counter (3500); 

a parallel operating function of reading one 
or more instructions from said memory (1200) 
indicated by said program counter (3500) and 
executing an operation indicated by said in- 
struction with respect to said register (FRO to 
FR31. RO to R31), said memory (1200) and 
said program counter (3500); 

said instruction being an instruction having 
a short word length for indicating a single 
operation or an instruction having a long word 
length for indicating a plurality of operations; 

means (150) for determining if said instruc- 
tion indicated by said program counter (3500) 
is said instruction having a short word length 
or said instruction having a long word length; 
and 



means (160) for executing any number of 
instructions for one machine cycle if said in- 
struction is determined to be the instruction 
having a short word length or executing one 
instruction for one machine cycle if said in- 
struction is determined to be the instruction 
having a long word length. 

16. A computer having: 

registers (FRO to FR31, RO to R31); 

a memory (1200); 

a program counter (3500); 

a parallel operating functio*^^ of reading one 
or more instructions from said memory (1200) 
indicated by said program counter (3500) and 
executing an operation indicated by said in- 
struction with respect to said register (FRO to 
FR31, RO to R31), said memory (1200) and 
said program counter (3500); 

said instruction being an instruction having 
a short word length for indicating a single 
operation or an instruction having a long word 
length for indicating a plurality of operations: 

means (150) for determining if said instruc- 
tion indicated by said program counter (3500) 
is said instruction having a short word length 
or said instruction having a long word length; 

means (109) for detecting a competition 
between said instructions each having a short 
word length; 

means (160) for executing one instruction 
for one machine cycle if said instruction is 
determined to have a long word length or 
executing any number of instructions until said 
competition is solved if any, according to the 
result of said competition detecting means 
(109) if said instruction is determined to have a 
short word length, or executing any number of 
instructions if any competition is detected. 



40 

17. A computer having: 

registers (FRO to FR31, RO to R31); 
a memory (1200); 
a cache memory (1300); 
45 a program counter (3500); 

a parallel operating function of reading one 
or more instructions from said memory (1200) 
indicated by said program counter (3500) and 
executing an operation indicated by said in- 
50 struction with respect to said register (FRO to 

FR31, RO to R31), said memory (1200) and 
said program counter (3500); 
said instruction containing: 
a first field for indicating a data transfer 
55 from said memory (1200) or said cache mem- 

ory (1300) to said register; and 

a second field for indicating a data transfer 
from said memory (1200) to said cache mem- 
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FIG.3 
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FIG.39 
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© A RISC processor is arranged to reduce a code 
size, nnake the hardware less compiicated, execute a 
plurality of operations for one machine cycle, and 
enhance the performance. The processor is capable 
of executing N instruction each having a short word 
length for indicating a single operation or an instruc- 
tion having a long word length for indicating M 
{N<M) operations. When the number of operations to 
be executed in parallel is large, the long-word in- 
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word instructions is detected by hardware and a 
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hardware brings about improvement of a machine 
cycle, improvement of a code cache hit ratio caused 
by the reduction of a code size and increase of the 
number of operations to be executed in parallel for 
the purpose of enhancing the performance. 
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